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Abstract 

The  development  of  reliable  distributed  software  is  simplified  by  the  ability  to  as¬ 
sume  a  fail-stop  failure  model.  We  discuss  the  emulation  of  such  a  model  in  an  asyn¬ 
chronous  distributed  environment.  The  solution  we  propose,  called  Strong-GMP,  can 
be  supported  through  a  highly  efficient  protocol,  and  has  been  implemented  as  part 
of  a  distributed  systems  software  project  at  Cornell  University.  Here,  we  focus  on  the 
precise  definition  of  the  problem,  the  protocol,  correctness  proofs,  and  an  analysis  of 
costs. 
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1  Introduction 


The  development  of  distributed  software  is  greatly  simplified  in  environments  where  process 
and  communication  failures  are  benign.  For  this  reason,  it  is  common  for  distributed  systems 
to  be  developed  under  the  assumption  that  the  communication  network  does  not  partition 
and  that  processes  are  fail-stop  (19,  20]  -  that  they  fail  only  by  halting,  and  that  these 
failures  are  detected  accurately. 

Unfortunately,  real  distributed  environments  are  not  entirely  benign  in  these  respects. 
On  the  one  hand,  the  assumption  that  programs  fail  by  halting  can  be  satisfied  to  a  good 
approximation  by  careful  development  methodology  and  testing.  Similarly,  most  communi¬ 
cation  failures,  such  as  message  loss,  corruption,  out-of-order  delivery,  and  replay,  can  be 
detected  and  corrected  at  low  cosi,  again  with  high  probability.  However,  this  is  not  the  case 
for  failure  detection  and  network  partitions.  Communication  partitions  are  unavoidable  in 
networks,  and  when  they  occur,  may  mimic  process  failures. 

It  is  well  known  that  the  consensus  problem  cannot  be  solved  in  asynchronous  systems 
subject  to  process  failures  [10],  and  this  is  often  taken  to  mean  that  software  for  realist ic 
environments  must  live  with  some  risk  of  inconsistent  failure  detection.  A  related  result  exists 
for  the  database  commit  problem  in  the  presense  of  partition  failures  [21].  A  consequence  is 
that  a  great  deal  of  the  ‘fault-tolerant’  distributed  software  used  in  contemporary  networks 
is  at  risk  of  some  form  of  inconsistent  or  incorrect  behavior  when  an  action  is  based  on  the 
apparent  detection  of  a  process  failure. 

That  such  inconsistencies  are  not  very  noticeable  testifies  to  the  ingenuity  of  systems 
developers  in  building  systems  for  which  inconsistency  is  not  a  fatal  condition,  but  also 
to  the  extremely  limited  use  of  genuinely  distributed  programs  in  modern  networks.  Most 
distributed  software  is  based  on  one-time  interactions  between  a  client  program  and  a  server: 
it  is  very  uncommon  to  see  distributed  systems  in  which  any  form  of  continually  evolving 
distributed  state  is  shared  among  multiple  processes.  In  client-server  systems,  it  is  uncommon 
that  the  detailed  behavior  of  different  programs  would  be  compared;  hence,  inconsistencies 
in  how  programs  report  and  react  to  failures  might  not  affect  the  ‘distributed’  computation, 
much  less  be  noticed  by  a  casual  observer. 

Unfortunately,  the  need  to  develop  fault-tolerant  distributed  software  with  non  trivial 
distributed  state  in  modern  computer  networks  is  seen  more  and  more  often  in  modern 
computer  applic;  ,;ons.  One  of  us  (Birman),  through  work  with  a  distributed  programming 
env>-onment  called  Isis  [6,  5].  has  grinH  »xp‘>ri*T>'-e  with  a  vide  rarge  of  complex  distributed 
applications  in  settings  such  as  telecommunications,  factory  automation,  finance,  scientific 
computing  and  the  military.  In  these  domains  one  finds  problems  that  are  inherently  di.s 


tributed  and  require  fault-tolerance,  and  also  in  which  complex  distributed  state  is  needed 
to  operate  the  desired  system  correctly.  For  example,  a  telecommunications  system  must 
react  to  failures  of  switching  nodes  in  a  consistent  manner;  inability  to  do  this  can  cause  the 
system  to  deny  service.  A  brokerage  system  may  need  to  provide  trading  advice,  based  on 
changing  market  conditions,  to  multiple  traders.  If  two  analytic  servers  are  started  because 
some  parts  of  the  system  incorrectly  sense  a  primary  server  as  having  failed,  different  traders 
may  be  given  differing,  inconsistent  advice.  In  settings  such  as  these,  inconsistent  behavior 
can  have  significant  implications  and  cannot  be  tolerated. 

Similarly,  modern  distributed  operating  systems  exhibit  features  that  require  accurate 
failure  detection.  For  example,  the  Mach  operating  system  [15]  identifies  communication 
endpoints  using  an  abstraction  called  the  communication  port.  Each  port  has  a  single  “receive 
right”,  bound  to  one  process.  Rights  to  send  data  to  a  port  can  be  passed  among  processes, 
and  are  carefully  tracked  by  Mach.  Mach  guarantees  that  communication  to  a  port  will  be 
reliable:  if  a  successful  outcome  is  reported  to  the  sender,  the  message  will  not  be  lost  unless 
the  destination  fails,  and  a  failure  is  reported  only  if  the  destination  is  faulty.  Additionally. 
Mach  notifies  the  holder  of  a  receive-right,  when  all  holders  of  send  rights  have  deleted  them 
(or  failed),  and  notifies  the  holder  of  a  send  right  if  the  corresponding  receiver  fails. 

Mach  is  widely  cited  for  its  simple  and  powerful  communications  model,  and  has  emerged 
as  an  industry  standard.  However,  it  is  easy  to  see  that  this  model  cannot  be  implemented 
in  a  way  that  is  both  safe  and  live:  the  only  “safe”  way  to  detect  a  failure  is  to  wait  for 
the  faulty  process  to  restart,  and  this  can  introduce  unbounded  delays!  At  the  time  of  this 
writing,  Mach  waits  for  failed  nodes  to  restart  before  reporting  failures,  hence  even  a  single 
failure  could  prevent  the  system  from  making  progress. 

Our  work  proposes  an  approach  which,  although  subject  to  limitations  stemming  from 
the  impossibility  results  cited  above,  is  nonetheless  extremely  powerful.  The  basic  idea  is 
to  substitute  a  logical  notion  of  system  membership  for  the  physical  notion  of  “operational” 
or  “failed”.  In  our  scheme,  application  programs  define  operational  processes  to  be  those 
listed  by  the  membership  service  and  failed  processes  to  be  those  not  listed  as  members  of  the 
system.  To  the  extent  that  the  membership  service  is  able  to  report  consistent  information  to 
processes  using  it,  those  processes  can  then  implement  consistent,  fault-tolerant  distributed 
algorithms. 

Our  membership  service  assumes  a  low-level  mechanism  that  monitors  the  status  of  pro¬ 
cesses.  The  membership  service  excludes  any  process  from  the  system  that  this  mechanism 
susptcls  of  having  failed.  If  the  removed  process  has  not,  crashed  (i.e.  the  suspicion  was 
incorrect  or  due  to  a  transient  condition  that  corrected  itself),  subsequent,  communication 
from  it  to  the  remainder  of  the  system  is  inhibited.  In  this  way  we  prevent  a  “zombie”  pro- 
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cess  from  contradicting  the  abstraction  presented  to  the  remaining  system  processes.  Lastly, 
a  faulty  process  that  recovers  will  be  notified  that  it  has  been  dropped  from  the  system. 

When  physical  partitions  occur,  our  membership  service  prevents  the  system  from  log¬ 
ically  partitioning.  More  precisely,  our  scheme  distinguishes  the  majority  partition  from 
minority  partitions.  By  defining  the  state  of  the  majority  partition  to  be  the  true  system 
state  and  limiting  the  actions  permitted  in  a  minority  partition,  logically  consistent  behavior 
can  be  guaranteed  even  wb^n  a  partition  occurs.  During  periods  when  a  majority  partition 
cannot  be  constituted,  our  scheme  might  treat  all  partitions  as  minority  ones,  effectively 
halting  the  system.  The  approach  is  thus  one  that  provides  rigorously  consistent  behavior 
at  all  times,  although  it  may  not  permit  progress  in  infrequent  situations  caused  by  severe 
network  partitions.1 

As  an  example,  our  membership  service  could  be  used  to  overcome  the  failure  detection 
problems  currently  encountered  in  Mach.  Mach  could  be  made  both  safe  and  live  if  it  were 
modified  to  1)  report  ‘apparent’  failures  to  our  service  (thereby  making  Mach  one  of  our 
suspector  mechanisms),  2)  treat  machines  and  processes  as  faulty  only  when  our  service 
reported  them  as  such,  and  3)  restrict  communication  to  members  of  the  system  (as  defined 
by  our  membership  service).  The  Mach  communication  guarantees  would  then  be  satisfied 
even  in  networks  where  transient  disruptions  would  sometimes  cause  Mach  to  suspect  failures 
incorrectly.  We  believe  that  most  application  developers  would  prefer  the  environment  our 
service  provides  to  one  that  could  incur  indefinite  delays. 

Agreement  on  the  membership  of  a  group  of  processes  in  a  distributed  system  is  a  clas¬ 
sic  problem,  and  has  been  treated  elsewhere.  Relevant  prior  research  includes  solutions  for 
database  contexts  [4],  real-time  settings  [8],  and  distributed  control  applications  [12,  5].  Cris- 
tian  [9],  specified  and  solved  a  problem  similar  to  the  one  we  consider  here,  but  in  contrast 
to  our  work,  he  considered  a  synchronous  setting.  Our  approach  and  solution  focus  on  the 
asynchronous  case,  but  differ  from  previous  work  on  group  membership  for  asynchronous 
systems.  The  membership  semantics  provided  by  Virtual  Partitions  [1]  are  weaker,  allow¬ 
ing  multiple  membership  views  to  exists  simultaneously,  and  requiring  neither  atomicity  nor 
uniformity  in  committing  new  views.  These  semantics  however  reflect  a  desire  to  maintain 
replicated  data  availability;  our  goal  is  to  provide  a  consistent,  unique  source  of  system-wide 
membership  information.  In  contrast  to  [13,  2],  which  also  permit  multiple  membership 

through  experience  with  several  hundred  Isis  applications,  we  have  observed  that  the  most  common 
partition  case  involves  a  single  processor  disconnected  from  its  LAN.  Network  '‘bridge”  failures  are  uncommon 
in  LAN  settings,  and  it  makes  sense  to  treat  WAN  systems  differently  from  LAN's  because  a  WAN  has 
different  performance  characteristics.  Other  systems  have  adopted  different  approaches  to  this  issue,  however, 
such  as  in  Dolev’s  Transis  project  [3]. 
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views,  we  do  not  assume  the  existence  of  an  underlying  fault-tolerant  atomic,  ordered  mul¬ 
ticast.  The  protocol  of  Mishra,  et.al.  [14]  also  relies  on  an  ordered  multicast.  In  these  cases, 
the  potential  membership  must  be  a  static  set  of  processes  so  that  the  multicast  ordering 
properties  can  be  maintained.  This  makes  handling  process  recoveries  more  straightforward, 
but  still  requires  additional  mechanism  to  join  newly-created  processes.  We  consider  only 
point-to-point  communication  and  an  arbitrary,  unknown  set  of  system  processes.  We  han¬ 
dle  joins  arising  from  both  process  recovery  and  process  creation  with  the  same  mechanism. 
The  protocol  in  Birman  and  Joseph  [5]  blocks  during  periods  when  failures  and  recoveries 
occur  continuously.  Our  solution  is  fully  "online’:  we  can  process  a  constant  flow  of  requests 
to  both  remove  and  add  processes,  which  is  exactly  what  occurs  in  actual  systems. 

In  Section  2  we  describe  our  system  model  and  the  forma!  language  we  will  use  to  specify 
the  Strong  Group  Membership  Problem  (Strong  GMP).  In  Section  3  we  specify  Strong  GMP, 
and  in  Section  4  we  present  our  solution,  the  S-GMP  algorithm.  Section  5  gives  the  main 
part  of  the  inductive  correctness  proof  and  discusses  the  protocol’s  message  complexity  and 
minimality.  We  conclude  by  discussing  the  implications  of  our  particular  specification,  and 
directions  for  future  work. 

2  The  System  Model  and  Formal  Logic 

We  consider  only  asynchronous  distributed  systems  in  which  processes  fail  by  crashing.  Dis¬ 
tributed  means  that  the  processors  are  physically  separated  and  that  processes  executing 
in  the  system  communicate  only  by  passing  messages  along  a  fixed  set  of  channels.  Asyn¬ 
chronous  me&ns  that  the  system  has  no  global  clock,  and  that  there  are  no  bounds  on  relative 
local  clock  speeds,  execution  speeds,  or  message  transmission  delays.  The  asynchrony  as¬ 
sumption  is  realistic:  system  load,  network  traffic,  and  any  other  dynamic  components  of 
the  system  that  affect  performance  all  conspire  to  violate  synchronization  assumptions. 

Before  defining  the  abstract  computational  model,  we  discuss  the  goals  and  effect  of  oui 
membership  service  for  processes  in  asynchronous  systems. 

2.1  Membership  Service  Goals 

This  paper  focuses  on  the  events  that  occur  at  processes  after  the  membership  service  is 
already  established  (Section  4.2.4  discusses  cold-starting  the  service).  Our  goal  in  building 
this  membership  service  is  to  provide  processes  in  an  asynchronous  system  subject  to  halting 
failures,  with  an  execution  environment  indistinguishable  from  a  synchronous,  halting-failure 
system.  Here,  the  term  “indistinguishable”  refers  to  the  sequence  of  events  observed  by  a 
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process  ivhile  it  is  a  member  of  the  system.  The  situation  for  a  process  excluded  from  the 
system  is  discussed  below 

Our  solution  has  the  property  that  when  a  process,  p,  learns  from  the  membership  service 
that  another  process,  q,  is  no  longer  a  member  of  the  system,  p  can  identify  an  event  in  its 
execution  after  which  it  will  never  receive  another  message  from  q.  For  />,  this  is  indistin¬ 
guishable  from  q  crashing  and  the  membership  service  detecting  it  accurately.  Moreover,  our 
solution  constructs  a  consistent  cut  [7]  along  which  every  other  functioning  member  of  the 
system  will  also  learn  that  q  is  excluded.  Consequently,  p  can  take  actions  that  depend  both 
on  q  having  crashed  and  on  all  other  processes  learning  this  concurrently  (just  as  it  could 
in  a  synchronous  environment).  In  a  normal  asynchronous  system,  p  would  have  neither 
guarantee. 

In  our  model,  a  new  process,  p ,  must  join  the  system  via  the  membership  service  before  it 
can  interact  with  other  processes.  The  service  responds  with  the  current  system  membership 
list,  and  thereafter  keeps  p  informed  of  each  change  to  the  list.  For  as  long  as  p  remains  on 
the  list,  it  can  send  messages  to  all  other  listed  processes,  and  communication  appears  to 
be  reliable  and  FIFO.2  Finally,  our  work  has  the  property  that  all  members  of  the  system 
observe  exactly  the  same  sequence  of  membership  changes  (join  and  leave  events),  even  when 
members  of  the  membership  service  itself  fail  or  join.  Elsewhere  [18]  we  show  how  this  strong, 
same-sequence  property  both  simplifies  distributed  algorithms  that  take  actions  based  upon 
membership  changes,  and,  somewhat  paradoxically,  actually  helps  in  reducing  the  cost  of 
the  membership  service  protocol  itself. 

Processes  that  genuinely  fail  do  so  by  halting.  We  require  that  such  a  process  is  eventually 
suspected  of  having  failed,  and  then  removed  from  the  system  list.3  Of  course,  no  live  failure 
detection  protocol  for  asynchronous  systems  can  avoid  mistakenly  suspecting  an  operational 
process  and  then  removing  it  from  the  membership  list  [10].  Because  exclusion  from  the 
membership  list  will  be  equated  with  failure,  such  exclusions  must  result  in  executions  that 
are  consistent  with  those  in  which  the  excluded  process  had,  in  fact,  failed.  Specifically,  we 
must  suppress  communication  from  a  process  that  has  been  erroneously  excluded.  To  this 
end,  in  addition  to  FIFO  and  channel  reliability  assumptions,  we  assume  processes  sever 

2It  is  well  know  that  an  underlying  message  transport  system  that  uses  sequence  numbers,  acknowledge¬ 
ments,  and  retransmission  can  overcome  message  loss,  duplication,  and  out-of-order  arrival. 

3While  we  are  not  concerned  with  the  implementation  of  the  failure  suspicion  module,  this  can  be  quite 
inexpensive.  In  particular,  because  our  membership  service  places  a  uniformly  observed  ranking  on  system 
members  it  is  not  necessary  that  every  process  monitor  every  other  process.  For  example,  the  scheme  used  in 
the  Isis  system  requires  each  process  to  monitor  only  the  next-highest,  ranked  process.  This  seems  to  imply 
a  linear  cost,  but  because  network  speeds  are  very  high  the  dominant  cost,  turns  out  to  be  the  overhead 
imposed  on  processes,  which  is  constant  and  unrelated  to  system  size  in  this  case. 
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communication  paths  with  all  others  they  believe  faulty.1 

From  the  excluded  process’s,  say  q' s,  point  of  view,  it  can  no  longer  communicate  with 
other  processes,  but  it  can  continue  local  computations.  To  illustrate  the  issues  suppose  q  had 
been  the  token-holder  in  a  protocol  that  orders  multicasts  among  a  subset.  S.  of  processes. 
Upon  learning  of  q's  failure,  the  remaining  processes  in  S  determine  a  new  token  holder,  say 
q1,  although  q  will  continue  believing  it  is  the  token  holder.  Since  q  can  no  longer  make 
its  message  ordering  known  to  S  ,  the  fact  that  q' s  and  qn s  orderings  may  differ  does  not 
violate  the  (observable)  correctness  of  the  message-ordering  protocol.  That  q  will  ‘observe" 
a  different  ordering  than  the  rest  of  S  is  irrelevant. 

2.2  System  Requirements  and  Model  Assumptions 

To  implement  the  FIFO  and  channel  reliability  properties  we  require  two  things  of  the 
physical  system.  First,  each  message  sent  along  a  channel  must  have  a  non-zero  probability 
of  reaching  its  destination  intact,  and  second,  each  process  must  have  a  local,  rnonotonically 
increasing  clock  (i.e.  counter).  These  two  requirements  suffice  to  implement  live  failure 
suspectors,  and  a  completely-connected  network  of  reliable,  FIFO  channels.  Our  protocols 
will  assume  this  complete  package  of  communication  guarantees,  but  we  are  not  concerned 
with  how  they  are  implemented. 

As  soon  as  one  processes,  p,  suspects  another  of  having  failed,  it  Disconnects  all  its  com¬ 
munication  channels  with  the  suspected  process.  Moreover,  to  hide  as  quickly  as  possible 
an  erroneous  suspicion,  p  Gossips  (for  example,  with  piggy-backs)  its  suspicion  to  all  other 
processes  in  further  communication,  whereupon  recipients  adopt  p’s  belief  and  also  discon¬ 
nect  themselves  from  the  suspected  process.  The  Gossip  and  Disconnect  actions  combine  to 
isolate  suspected-faulty  processes  among  processes  not  believing  each  other  faulty. 

The  [10]  impossibility  result  can  be  interpreted  as  forcing  applications  in  asynchronous 
systems  to  either  make  accurate  failure  detections  or  be  live.  By  choosing  liveness,  we 
admit  the  possibility  of  erroneous  failure  detections,  but  by  isolating  mistakenly  suspected 
processes,  we  prevent  them  from  further  affecting  the  global  system.  As  a  result,  q  halting 
and  q  mistakenly  suspected  to  have  halted  are  indistinguishable. 

4Because  the  communication  layer  is  asynchronous,  messages  from  an  excluded  process  may  continue  to 
arrive,  and  be  rejected,  for  an  unbounded  period  of  time.  The  communication  layer  would  also  inform  an 
excluded  process  that  it  has  been  excluded,  causing  it  to  rejoin  the  system  under  a  new  process  identifier. 
The  protocols  needed  to  implement  such  a  transport  layer  are  evident  and  will  not  be  presented  here. 
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2.3  The  System  Model 

Denote  by  Proc  a  countable  set  of  process  identifiers,  {pi,p>. . . The  process  name  space 
is  infinite  so  that  we  can  model  infinite  executions  in  which  new  processes  continually  arise. 
However  because  there  can  be  only  fmitely-many  processors,  and  because  process  births 
require  non-zero  time,  the  number  of  processes  extant  at  any  real  time  in  an  execution  will 
always  be  finite. 

Processes  may  send  and  receive  messages,  and  do  internal  computation.  The  event 
sendp(q,rn)  denotes  p  sending  message  m  to  </,  and  recvq(p,m)  deputes  r/’s  receipt  of  m 
from  p.  The  distinct  internal  event  crashP  models  the  crash  failure  of  process  p,  after  which 
only  other  crashp  events  are  permitted.  A  history  for  process  p,  denoted  hp.  is  a  sequence  of 
events  executed  by  p,  and  must  begin  with  the  distinct,  internal  event  startp: 

hv  d=  (startp  ■  - 1\  ■  •  •  k  >  0. 

We  write  e  €  hp  when  e  is  an  event  of  hv.  A  cut  is  an  n-tuple  of  process  histories  c  — 
(hPl,  hn, . . . ,  hPn),  where  p,  €  Proc.  We  restrict  our  attention  to  cuts  determined  by  finite 
subsets  of  Proc  since  these  represent  the  system's  global  system  state  <xt  some  real  time  in  its 
execution.  Each  execution  begins  with  the  distinct  cut,  Co  =  <  startTl ,  startp, .  . . ,  startTn  >. 
We  also  write  e  €  c  to  abbreviate  “e  6  hp  for  some  p  mentioned  in  c" ,  and  elaborate  when 
the  context  does  not  clearly  distinguish  the  intention. 

We  assume  familiarity  with  the  happens  before  relation  [11]  between  events  (written 
e  — >  e'),  and  also  with  consistent  cuts  [7],  Henceforth  we  restrict  the  discussion  to  consistent 
cuts,  as  they  are  the  ones  that  are  physically  realizable.  Consistent  cuts  are  the  possible 
global  states  of  an  execution;  while  a  given  consistent  cut  may  never  have  existed  at  any 
point  in  real  time,  it  is  impossible  for  a  cut  that  is  not  causally  consistent  to  ever  exist  at 
any  point. 

A  characterization  of  global  causality  should  incorporate  the  notion  of  progress  between 
global  states.  Specifically,  we  desire  that  every  process  either  makes  local  progress  or  remains 
stationary,  none  should  regress.  A  process  makes  local  progress  between  the  cumulative  states 
represented  by  h p  and  h'p  exactly  when  hp  r  a  prefix  of  h'p. 

Definition  Given  c  =  (hi , . . . ,  hp, . . . ,  hn)  and  c'  —  (h\, - h' , .  . ,  h\)<  c  causally  precedes 

c!  (written  c  <  c!)  if  and  only  if  for  each  process,  p  either  1)  hp  =  h' ;  or  2)  hp  is  a  strict 
prefix  of  h'p. 

Observe  that  there  are  (infinitely)  many  completions  fot  any  given  cut.  In  this  sense,  the 
future  of  any  cut  is  uncertain;  it  may  branch  out  in  many  directions.  On  the  other  hand. 
c  <  c!  implies  that  any  execution  in  which  c'  is  a  prefix  must  a1  so  contain  c  as  a  prefix. 
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Figure  1:  hPi  is  a  strict  prefix  of  h'Pt  for  each  pt  so  c  <C  r'. 

Definition  Let  c  —  (hi, .  ■ . ,  hv, . . . ,  hn)  and  c'  —  (h\,  ...,h'...,  h'n)  be  consistent  cuts. 
Then  c  strictly  precedes  c'  (written  c  <  c')  if  and  if  c  <  c'  and  c  c':  the  cut  c  very  strictly 
precedes  d  (written  c  <C  c')  if  and  only  if  hp  is  a  strict  prefix  of  h'p  for  each  p  mentioned 
(Figure  2.3).  tf 

2.4  The  Modal  Logic 

So  far,  our  description  of  Strong  GMP  refers  to  when  core  members  agree  on  the  group  view, 
as  well  as  the  degree  of  simultaneity  with  which  they  do  so.  A  temporal  modal  logic  allows 
us  to  express  these  notions.  Unique  to  our  logic  is  its  attention  to  asynchrony  -  the  basic 
semantic  entities  of  the  logic  are  consistent  cuts.  We  briefly  describe  the  temporal  modalities 
we  use  to  specify  Strong  GMP. 

Given  a  propositional  formula,  <f> ,  and  the  <  relation  between  cuts,  the  formula  □  0  holds 
along  cut  c  precisely  when  <f>  holds  along  all  future  cuts  in  all  runs  containing  c  (i.e.  every 
c'  such  that  c  <  c').  O0  holds  along  c  when  <f>  will  hold  along  some  future  cut  in  every  run 
containing  c.  We  interpret  O  as  “inevitability”.  O <j>  holds  along  1.  if  c£>  held  at  some  c'  <  c, 
and  3<t>  if  <f>  held  along  all  c'  <  c. 

3  Strong  Group  Membership 

We  now  formally  define  the  Strong  Group  Membership  Problem  for  asynchronous  systems. 
Our  definition  specifies  how  to  coordinate  local  events  among  a  group  of  processes  so  that  the 
group’s  externally  observed  behavior  is  indistinguishable  from  that  of  a  single,  fault -tolerant 
process.  Thus,  any  solution  to  Strong  GMP  can  be  used  to  build  a  system  membership 
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service  (which  we  call  a  Membership  Resource  Manager,  or  MRM).  In  this  section,  and  in 
the  rest  of  the  paper,  we  restrict  our  focus  to  the  con  processes  implementing  the  MRM 
the  formal  problem  describing  their  actions,  and  the  algorithm  solving  this  problem.  Thus, 
we  describe  a  hierarchical  approach  to  building  a  Strong  GMP  membership  service,  in  that 
our  protocol  is  run  by  a  small  core  set  of  processes,  which  use  a  cheap  replication  scheme  (e.g. 
the  Isis  replication  tools)  to  maintain  a  fault-tolerant  member  list  for  the  overall  system. 

Creating  the  illusion  of  a  single  fault-tolerant  process  means  that  core  members  must 
agree,  not  only  on  the  entire  system  membership,  but  also  on  the  composition  of  the  MRM 
core.  A  core  member  that  fails  or  is  otherwise  removed  must  be  consistent  with  the  rest  of 
the  core  while  it  is  a  member.  More  important,  a  core  member  that  is  removed  from  the 
core  but  has  not  halted  must  not  be  able  to  misrepresent  the  system  state;  our  specification 
must  preclude  such  a  process  from  changing  its  local  view  of  the  system's  membership  or 
the  core's  membership  independently. 

3.1  Formal  Specification 

The  formula  UPP  holds  along  a  cut  if  and  only  if  p  has  not  executed  crash;,  in  its  local  history 
component  of  that  cut.  Conversely,  DOWNp  holds  along  c  exactly  when  p  has  crashed  in  c. 
The  indexical  set  Up (c)  in  an  asynchronous  run  A  is  the  set  of  all  processes  that  have  not 
crashed  along  c  :  Up(c)  =  {p  j  lJPp  holds  along  c}. 

Process  p  executes  the  event  faultyp(q )  as  soon  as  it  -.aspects  q  faulty;  whether  p  comes 
to  suspect  q  through  some  local  observation  or  through  our  Gossip  assumption  (Section  2.2) 
is  immaterial.  Some  time  after  recording  faultyp(q),  p  will  execute  the  event  mnovep(q). 
The  distinction  between  these  events  is  significant:  faultyp(q)  represents  p' s  belief  in  q's 
fauiuness,  which  may  be  incorrect,  w'hile  removeP{q)  is  actual  removal  of  q  from  the  set  of  coie 
members  p  believes  operational.  The  formula  FAULTY p(q)  holds  along  all  cuts  that  include 
faultyp(q),  and  REMOVEp(^)  along  all  cuts  that  include  rernovcp(q).  Analogous  statements 
hold  for  events  operatingp(q)  (p  believes  q  is  functional)  and  addp(q)  (p  adds  q  to  the  set 
of  core  members),  and  formulas  OPERATING,,^)  and  ADDp(<7).  In  contrast  to  FAULTYP(<7), 
OPERATINGp(<7)  is  not  stable. 

The  local  membership  view  for  process  p  along  cut  c  =(fij, . . . ,  hp, _ hn),  (denoted 

LocalViewp(c)),  is  the  set  of  processes  p  obtains  by  sequentially  modifying  its  initial  mem¬ 
bership  list  according  to  the  removcp()  and  addp ()  events  in  hp.  We  use  localViewp  when  the 
cut  is  dear  from  context.  Trivially,  vve  require  p  <=  LocaiViewp(c).  The  formula  IN-LOCALp(<?) 
holds  along  all  cuts,  c,  such  that  q  €  LocalViewp(c).  Because  hp  is  linear,  it  makes  sense  to  talk 
about  the  xth  version  of  p's  local  view,  which  we  denote  LocalViewp.  Finally  I.N-LOCALp(g) 
holds  when  q  6  LocalViewp.  The  formula  NoTDEF,D(LocalViewp)  holds  along  c  if  p  has  not 


(yet)  defined  it’s  xth  local  view. 

We  extend  local  views  to  group  views  as  follows,  (liven  S  C  Proc,  and  a  consistent  cut 
c,  if  the  local  views  of  all  the  functional  processes  in  S  are  identical,  the  group  view  is  the 
agreed-upon  local  view';  if  5  has  no  functioning  members  or  if  the  functioning  members  of 
S  have  different  local  views,  the  group  view  is  undefined.  We  say  that  S  determines  a  group 
view.  Formally  : 


Definition  Given  a  consistent  cut  c  and  a  set  of  processes,  S  C  Proc,  the  group  view 
determined  by  S  along  c  is  : 


I 


GpViewe(c) 


LocatVieWp(c)  A (p,QG  (SnUp(c))  ^  0 'j  : 

^LocalView  p(c)  =  LocalView9(c)^ 
undefined  otherwise. 


The  formula  UNDEF’DfGpView^c))  holds  along  c  if  the  local  views  of  any  functional 
members  of  S  disagree  or  if  S  D  Up(c)  =  0. 

Constraining  Membership  in  GpView$(c) 

The  definition  of  GpView£(c)  is  crucial  to  the  class  of  Group  Membership  Problems  so  it 
is  worthwhile  discussing  how  the  sets  S  and  GpView^c)  relate.  Recall  that  GpView^fc)  is 
the  abstraction  we  are  using  to  define  the  single,  fault-tolerant  process  illusion  that  will  he 
used  to  build  an  MRM.5  In  this  light,  MRM  core  members  are  precisely  the  members  of 
GpView$(c). 

If  q  €  (GpView^fc)  (1  S)  then  q  is  a  core  MRM  m ember  whose  local  view  is  not  used  in 
determining  either  the  MRM  composition  or  the  total  system  membership.  Specifically,  g' s 
local  view  is  not  constrained  by  the  definition  of  GpView^(c),  so  LocalView, (c)  need  not  be 
identical  to  GpViewc^c).  Because  q  replies  to  MRM  client  requests  based  on  its  local  view, 
its  replies  will  contradict  other  core  members’  replies  when  LocalView, (c)  ^  GpView^c):  the 
single-process  illusion  falls  apart.  Consequently,  unless  every  core  member’s  local  view  is 
used  to  determine  the  MRM  group  view,  the  MRM  cannot  guarantee  global  consistency. 

To  avoid  this,  our  specification  forces  q  to  be  in  5  whenever  it  is  in  GpViewg(c)  : 

^GpView$(c)  n  Up(c)^  C  (s  H  Up(c)). 

5In  practice,  the  group  view,  and  therefore  each  core  member’s  local  view,  includes  the  entire  system 
membership  in  addition  to  the  MRM  composition;  here  we  are  only  concerned  with  the  MRM  composition 


The  reverse  inclusion  follows  trivially  since  p  €  LocalViewp(c).  Consequently,  our  specifica¬ 
tion  requires  S  =  GpView$(c). 

To  finish  the  single-process  illusion,  the  MRM  must  be  unique.  We  will  therefore  also 
require  that  there  be  at  most  one  set  satisfying  this  equality  along  any  consistent  cut.  Since 
there  can  only  be  one  MRM,  some  form  of  quorum  consensus  is  needed  to  change  the  global 
system  membership.  If  a  quorum  cannot  be  attained  (for  example  during  certain  partitions), 
no  solution  to  Strong  GMP  can  make  progress. 

Finally,  GpView$(c)  is  defined  if  and  only  if  the  local  views  of  all  its  functioning  members 
agree.  Processes  that  are  eventually  removed  from  the  core  are  not  excused  from  having 
consistent  views  while  they  are  members.  Moreover,  a  core  member  that  is  removed  but 
has  not  crashed  cannot  be  allowed  to  change  its  local  view.  Our  specification  captures 
these  safety  issues  in  two  clauses:  GMP-2  formalizes  the  Uniqueness  requirement  for  group 
views,  and  GMP-3,  by  requiring  every  local  view  to  exist  as  a  group  view,  prevents  excluded 
processes  from  taking  actions  unilaterally. 


3.2  Strong  GMP  Specification 

We  now  have  the  language  necessary  to  formalize  Strong  GMP.  Since  formulas  are  evaluated 
along  cuts  we  drop  references  to  cuts  in  indexical  sets. 


GMP-0  (Base  Case)  An  initial  group  view  eventually  exists: 

O  V  (So  =  GpViewSo()). 

S0cProc 

GMP-1  (Validity)  Processes  do  not  make  changes  to  their  local  views  capriciously.  For 
example,  if  q  were  once,  but  is  not  currently,  in  LocalViewp,  then  p  should  believe  q 
faulty. 

a.  ^OlN-LOCALp^)  A  ->IN-LOCALp(<7)^  =>  FAULTYp(qr) 

b.  (V-'IN-LOCALp(<7)  A  IN-LOCALp(q')^  =>  OPERATINGp(<7). 

In  contrast  to  FAULTYP(^),  OPERATINGp(g)  is  not  stable. 

GMP-2  (Uniqueness)  Non-null  group  views  are  unique  along  all  consistent  cuts. 


V  (GpView5()  =  Sj  =>  f\  UNDEF’D(GpView^()) 
ScProc  0*SVS 
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The  formula  IN~GPp  holds  along  all  cuts,  e,  such  that  p  £  GpView$(c)  (when  it  is 
defined);  OUT-GPp  holds  when  p  £  GpView^c)  (also  provided  GpView^(c)  is  defined). 


GMP-3  (Sequence)  All  processes  exhibit  the  same  sequence  of  local  views,  provided  the 
views  are  defined.  Moreover,  there  is  a  sequence  of  cuts  along  which  each  local  view  is 
a  system  view: 

AAoA 

0<r  V  1 

^lN-LOCALp(q)  =>  DOWN,  V  ^LocalView,()  =  LocalViewp()  =  LocalViewp^  A 
-'iN-LOCALp(^)  A  \/  iN-LOCAL$(q)J  =>  □  NotDef’d(  Local  View*) 

y<x  ' 


GMP-4  (Liveness)  For  each  event  faultyp{q )  (respective,  operatingp(q))  and  each  process 
p  €  GpView*,  eventually  either  p  is  removed  from  the  group  view,  or  q  is  removed  from 
it  (respective,  added  to  it): 

a.  FAULTYp(g)  A  IN-GPp  =>  ^OOUT-GP,  V  OOUT-GPpj 

b.  OPERATINGp(q)  A  IN-GPp  =>  ^OlN-GP,  V  <C>OUT-GPp^. 

GMP-3  is  equivalent  to  requiring  that  each  local  view  eventually  becomes  a  group  view. 
The  presence  and  placement  of  the  O  modality  forces  a  group  view  to  exist  along  some 
consistent  cut  in  an  execution.  This  too,  is  why  we  cannot  bind  LocalView,  to  a  version 
number.  Local  views  indexed  by  version  numbers  are  static  -  the  composition  of  a  process’s 
xth  local  view  will  never  change.  If  c  is  the  witness  cut  for  O,  then  omitting  the  version 
superscript  forces  LocalView,(c)  (for  every  q  £  LocalView*)  to  be  identical  to  LocalView*  at 
least  along  c.  Had  we  included  the  version  number  in  the  equality  clause,  we  would  not  have 
been  able  to  conclude  that  group  views  necessarily  exist,  since  the  local  views  need  not  have 
been  identical  simultaneously. 

Finally  since  each  process  executes  at  least  one  event  between  local  views  x  and  x  -f  1, 
the  corresponding  group  views  will  exist  along  cuts  that  are  related  by  <C,  so  it  makes  sense 
to  talk  about  the  xih  group  view,  which  we  denote  GpViewA 


4  A  Protocol  Solving  Strong  GMP 

Our  solution  to  Strong  GMP,  the  Strong  Group  Membership  Protocol  (hereafter  s-gmp), 
is  asymmetric  and  centralized:  a  distinguished  core  member,  denoted  mgr ,  coordinates 
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M-sub(— M-com(— q) 


Phase  I  Phase  II 


Figure  2:  Two-Phase  Communication  Structure  of  Simple  S-GMP. 

updates  among  all  core  members’  local  views.  In  a  symmetric,  distributed  solution  [14,  3] 
all  core  members  would  behave  identically  and  make  updates  independently.  We  chose  the 
centralized  approach  for  two  reasons:  it  requires  only  O(n)  point-to-point  messages,  instead 
of  0(n2),  and  it  is  a  simpler  paradigm  within  which  to  reason.  While  mgr ’s  failure  is  more 
troublesome  to  handle  than  an  outer  ( non-mgr )  member’s,  the  benefits  of  the  centralized 
approach,  coupled  with  the  low  probability  of  the  mgr  failing  outweigh  these  concerns. 

An  important  aspect  of  s-GMP  is  the  lack  of  restrictions  on  changes  to  a  group  view. 
Specifically,  there  is  no  upper  limit  on  the  number  of  processes  one  can  add  to  GpViewx  to 
form  GpView*+1;  if  removing  processes  from  GpView1,  the  upper  limit  is  the  size  of  the 
largest  minority  subset  of  GpView1.  This  flexibility  broadens  fault-tolerance,  and  enables  a 
membership  service  defined  by  Strong  GMP  to  adapt  quickly  to  changes  in  system  load.  The 
Appendix  contains  the  complete  protocol. 

4.1  Simple  s-gmp 

While  we  assume  in  this  section  that  mgr  does  not  fail,  the  protocol  we  present  has  a 
more  complicated  communication  structure  and  degree  of  coordination  than  this  assumption 
warrants.  Indeed,  if  we  knew  mgr  could  not  fail  we  would  already  have  a  single,  fault-tolerant 
process.  Anticipating  mgr's  failure  simplifies  describing  reconfiguration  in  Section  4.2. 

When  mgr  suspects  an  outer  member’s  (or  some  subset  of  outer  members’),  say  r’s, 
failure,  it  initiates  a  two-phase  update  algorithm.  In  Phase  I  (Figure  2)  mgr  proposes  q' s 
removal  by  multicasting  a  submit  message,  M-sub(— g),  to  the  members  of  its  local  view 
(multicasts  are  not  failure-atomic),  mgr  then  waits  for  each  member  to  respond,  or  to  start 
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believing  a  member  faulty.  In  this  way,  at  the  end  of  Phase  I,  all  core  members  that  mgr  does 
not  believe  faulty,  believe  q  faulty.  If  mgr  receives  responses  from  a  majority  subset  of  its 
current  local  view,  it  multicasts  a  commit  message,  M-com(— <7),  in  Phase  II;*’  mgr  must  block 
if  it  does  not  receive  a  majority  response.  If  local  views  are  identical  at  the  beginning  of  this 
protocol,  because  mgr  is  a  single  process,  local  views  arc  identical  at  the  end  of  it. 

The  submit  message  coordinates  belief  among  the  core  in  q's  faultiness;  the  commit 
message  tells  outer  members  that  the  group  has  reached  agreement  on  q's  failure  and  that 
they  should  now  remove  q  from  their  own  local  views.  However,  because  mgr  does  not  receive 
responses  from  outer  members  it  believes  faulty,  it  cannot  know  whether  these  members 
received  its  submit  message.  From  mgr ’s  perspective,  these  members  may  not  be  aware 
of  the  current  update  to  the  group  view,  rendering  core-wide  agreement  on  the  new  view 
contingent  upon  the  subsequent  removal  of  these  ‘faulty’  members.  The  Gossip  assumption 
ensures  that  operational  outer  processes  become  aware  of  such  contingencies. 

When  adding  to  the  group  view,  mgr  sends  the  new  process(es)  p  a  State-Xf  er  message 
giving  p  permission  to  join  and  informing  p  of  all  relevant  system  state,  mgr  awaits  a  reply 
(or  suspicion  of  p’s  faultiness)  and  then  multicasts  the  commit  message  to  the  entire  new 
group.  To  simplify  bookkeeping,  new  members  begin  with  local  version  equal  to  the  group 
version  in  which  their  addition  resulted. 

4.2  Full  S-GMP 

When  mgr  is  believed  to  have  failed  the  outer  members  execute  a  reconfiguration  algorithm  to 
select  a  new  coordinator  and,  if  necessary,  reestablish  the  group  view.  Local  view  agreement 
may  be  lost,  for  example,  when  mgr  fails  in  the  middle  of  a  M-com()  multicast.  In  Figure  3 
local  views  differ  along  the  second  cut  so  the  group  view  is  undefined. 

Reconfiguring  successfully  involves  solving  two  problems:  succession  -  which  process(es) 
should  initiate  reconfiguration  and  which  should  assume  the  mgr  role  at  the  end;  and  pro¬ 
gression  -  which  update  should  a  reconfiguration  initiator  propose  to  resolve  core  members' 
inconsistencies? 

A  reconfigurer  must  be  able  to  determine  the  last  defined  group  view  and  propagate  the 
correct  proposal  for  the  succeeding  group  view.  Extrapolating  from  Figure  3,  we  see  that 
proposals  may  also  be  partially  known  among  the  current  group  view. 

The  most  difficult  aspect  of  reconfiguring  involves  invisible  commits.  An  invisible  commit 
occurs  when  the  only  processes  receiving  a  commit  message  fail,  or  are  believed  faulty  by  the 

6Typically,  a  phase  of  communication  consists  of  a  multicast  from  a  single  process  to  a  group  of  processes 
and  their  responses  back  to  the  initiator.  In  fact,  Simple  S-gmp  is  one-and-one-half  phases,  but.  this  is 
awkward. 
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mgr 


Figure  3:  mgr ’s  Failure  Results  in  Undefined  Group  View. 

rest  of  the  group.  This  is  significant  for  reconfiguration:  while  no  subsequent  reconfigurer 
will  ever  know  whether  these  processes  committed  the  change  to  their  local  views,  GMP-3 
requires  that  if  an  invisible  commit  did  occur,  the  remaining  core  members  must  behave 
consistently.  It  is  imperative,  then,  that  every  invisibly  committed  update  be  detectable  by 
every  reconfigurer.  We  can  ensure  this  only  if  all  initiators  (whether  mgr  or  a  reconfigurer) 
attempting  to  install  the  xth  group  view  vie  for  the  requisite  majority  responses  from  among 
the  same  set  of  processes. 

4.2.1  The  Reconfiguration  Algorithm 

Unlike  the  mgr -initiated  algorithm,  reconfiguration  requires  three  phases  in  the  worst  case. 
This  is  an  outgrowth  of  the  Sequence  requirement  (GMP-3)  and  the  possibility  of  invisible 
commits;  we  discuss  this  below  in  more  detail.  For  simplicity,  we  present  the  algorithm  here 
as  always  using  three  phases  ([17]  discusses  the  cases  when  two  suffice). 

For  p  with  ver(p)  =  x—  1,  let  NextUpdatep  be  the  tuple  [<  u,x  >,  rank(i)]p,  where  v  is  the 
value  p  is  waiting  to  commit  to  form  LocalView*,  and  rank(i)  is  the  rank  (in  LocalVieWp-1 ) 
of  the  initiator  that  submitted  <v,x>.  LastCommitp  is  the  value  p  committed  to  form 
LocalVieWp-1.  state(p)  is  p's  local  state  information:  ver(p),  NextUpdatep,  and  LastCommitp. 

In  the  first  phase,  the  initiator  r,  multicasts  a  reconfiguration  interrogate  message, 
R-int(sfa£e(r)),  to  its  local  view.  The  reconfigurer  then  awaits  responses  from  the  outer 
processes,  or  its  own  belief  in  their  faultiness.  Upon  receiving  R-int (state(r))  a  core  mem¬ 
ber  that  is  lagging  behind  r  adopts  r’s  local  state  as  its  own  (committing  the  appropriate 
value,  and  so  forth).  Every  core  member,  whether  it  just  updated  its  local  state  or  not, 
responds  to  the  reconfigurer  with  its  current  local  state,  state(). 

If  a  majority  respond,  then  r  uses  the  information  it  received  to  determine  an  update 
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value,  say  v,  and  version  number,  say  x,  whose  execution  would  result  in  a  new  group 
view.  The  initiator  multicasts  this  event  as  the  Phase  II  reconfiguration  submit  message, 
R-sub(<  v,x  >).  After  obtaining  a  second  majority  response  acknowledging  R-sub(<  v.x  >). 
r  multicasts  the  Phase  III  reconfiguration  commit  message,  R-coro(<  v,x  >).  Again,  major¬ 
ity  response  to  R-int(stafe(r))  and  R-sub(<  v,x  >)  are  essential  in  maintaining  GMP-2  and 
GMP-3;  without  either,  r  must  block.  If  the  committed  operation  is  the  addition  of  a  set 
of  processes,  say  Q,  then  q  £  Q  must  respond  with  any  pending  NextUpdate,  value  it  may 
have  (Figure  4).  This  is  necessary  to  maintain  GMP-2  and  GMP-3  in  cases  where  Q  had 
already  joined  at  the  behest  of  a  previous  intiator,  for  example  mgr  .  If  mgr  had  been  able 
to  propose  and  additional  update  to  Q,  say  M-sub(<  +R,:r  +  2  >),  it  may  also  have  been 
able  to  commit  <  +R,i  +  2  >  invisibly  to  r  and  Q. 

Definition  An  update  initiator  (either  mgr  or  a  reconfigurer)  is  successful  for  a  submission 
(M-sub(<  v,  x  >)  or  R-sub(<  v,  x  >))  if  a  majority  subset  of  the  initiator’s  local  view  respond 
to  the  submission.  In  this  case,  we  say  the  submitted  value  is  stable.  I 

A  successful  initiator  is  able,  if  it  does  not  fail,  to  commit  the  value  it  submitted.  In  this 
light,  GMP-3  means  that  all  successful  version  x  initiators  must  make  identical  proposals. 
The  local  state  information  collected  during  reconfiguration  Phase  I  must  allow  a  reconfigurer 
to  determine  the  correct  update  proposal  unambiguously. 

In  S-GMP  all  successful  reconfigurers  attempting  to  install  (or  complete  the  installation 
of)  the  xth  group  view  propagate  mgr' s  proposal  if  they  become  aware  of  it;  they  propose 
mgr ’s  removal  if  they  do  not.  Unfortunately,  as  Figure  5  makes  clear,  asynchrony  and 
inopportune  failures  can  result  in  there  being  two  different  proposals  for  the  same  instance 
of  the  group  view.  There,  reconfigurer  rq  does  not  learn  of  mgr' s  proposal,  <  v,x  >,  and  so 
proposes  mgr ’s  removal  for  version  x  (as  dictated  by  Procedure  DetermineProposal  in  the 
Appendix).  The  subsequent  reconfigurer  r2  learns  of  both  proposals  and  must  then  decide 
which  to  propagate.  Correctness  requires  that  only  one  of  the  two  proposals  become  stable, 
and  that  any  non-blocking  reconfigurer  be  able  to  determine  which  one  it  is  by  the  end  of 
Phase  I  (we  discuss  how  a  reconfigurer  determines  this  in  Section  4.2.3).  By  propagating  the 
stable  submission,  a  reconfigurer  forces  the  entire  group  to  act  consistently  with  any  invisible 
commits. 

4.2.2  Rules  of  Succession 

We  solve  the  succession  problem  by  imposing  a  deterministic,  linear  ranking  on  core  members 
based  on  seniority  in  the  group  view  -  ‘older’  core  members  are  ranked  higher.  This  is  sensible 
only  if  group  views  are  unique  and  agreed-upon.  Let  rank(p)  denote  p's  rank.  Whenever  a 
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r  determines  R-sub(<  +Q,i  +  1  >)  R-com(<  +Q>1  +  t  >)  then 


but  may  next  M-sub(<  —mgr ,  x  +  2  >)  M-sub(<  +R,  x  +  2  >) 


Figure  4:  A  Situation  Requiring  New  Core  Members  to  Report  NextUpdate  to  Initiator 
(mgr  ’ s  commit  message  cannot  reach  any  processes  except  Q). 
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R~int(...) 


Figure  5:  Reconfigurer  r2  Learns  of  Conflicting  Proposals  for  GpView*. 

process  is  removed  from  the  group  view,  the  ranks  of  all  higher-ranked  processes  are  decreased 
by  one. 

A  process  initiates  reconfiguration  when  it  believes  all  others  ranked  higher  than  itself 
are  faulty.  That  is,  given  cut  c  and  LocalViewp(c), 

INITIATE(p)  =  f\  f  (rank(?)  >  rank(p)J  =>•  FAULTYp(<j)j 

?eLocalViewp(c)  '  ' 

While  initiating  reconfiguration  on  iNlTlATE(p)  can  lead  to  multiple  reconfigurations,  it 
guarantees  at  least  one  process  will  undertake  reconfiguring.  Consider  Figure  6  in  which 
rank  (mgr )  —  p,  rank(p)  =  p  —  1,  and  rank(g)  =  p  —  2,  and  both  p  and  q  believe  mgr  faulty. 
In  the  second  scenario  q  expects  p ,  which  has  crashed,  to  initiate  a  reconfiguration;  any 
solution  must  ensure  that  q  eventually  comes  to  suspect  p  faulty.  In  S-GMP,  q  times-out 
waiting  for  p’s  R-int()  message,  surmises  FAULTY,(p),  and  then  initiates  reconfiguration. 
In  the  third  scenario  both  p  and  q  initiate  reconfigurations.  S-GMP  must  also  ensure  view 
uniqueness  in  the  face  of  multiple,  concurrent  reconfiguration  attempts. 
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Figure  6:  Initiating  Reconfiguration:  FAU LTYP( mgr  )  A  FAULTYv(mpr  ).rank(mpr  )  = 


rank(p)  -f  1  =  rank(q)  +  2. 


faultyQ(r) 


Figure  7:  Majority  of  Responses  Needed 

4.2.3  Rules  of  Progression 

To  understand  the  difficulties  in  reconfiguring  we  examine  GMP-2  and  GMP-3  more  closely. 
Uniqueness  requires  that  at  most  one  group  view  exists  along  any  consistent  cut.  In  the 
situation  depicted  in  Figure  7,  Q  and  R  are  subsets  of  GpView*,  and  q  and  r  are  both 
initiating  reconfiguration.  If  all  members  of  Q  believe  r  faulty,  the  Disconnect  assumption 
means  they  will  receive  none  of  r's  messages.  Analogous  statements  hold  for  the  members  of 
R  regarding  q.  If  r’s  proposal  differs  from  q's  then  the  members  of  R  will  commit  a  different 
value  than  the  members  of  Q.  If  R  U  {r }  eventually  remove  all  of  Q  U  {<7},  and  Q  U  {9} 
eventually  remove  all  of  R  U  {r},  two  distinct  group  views  will  exist. 

Naively,  it  would  appear  that  the  majority  requirement  suffices  to  ensure  Uniqueness. 
However,  as  Figure  3  makes  clear,  initiators  that  may  end  up  installing  (submitting  and 
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Figure  8:  Value  <v,x>  Committed  Invisibly  to  p,  q,  and  r 

committing)  the  same  group  version  need  not  begin  reconfiguration  with  identical  local 
views  and  so  may  be  seeking  majority  approval  from  different  sets  cf  processes. 

Reconfiguration  Phase  I  Responses 

Outer  processes’  responses  to  R-int (state(r))  must  allow  r  to  determine  the  nature  and 
composition  of  all  local  view  inconsistencies,  including  inconsistencies  involving  core  members 
that  did  not  respond  to  r.  Local  view  information  alone  is  insufficient  to  satisfy  GMP-3 
(Sequence)  as  invisible  commits  are  not  detectable. 

In  Figure  8,  <  v,x  >  is  committed  invisibly  to  p,  q ,  and  r.  Since  all  three  have  identical 
local  views,  r  will  not  detect  the  actual  discrepancy.  However,  p  is  aware  of  mgr ’s  intention 
to  commit  <  v,x  >,  and  p  can  envision  a  situation  in  which  mgr  succeeded  in  doing  so 
and  then  failed  (in  this  case,  the  situation  that  actually  occurred).  If  p  were  to  forward 
mgr ’s  intention  to  commit  <  v,  x  >,  r  would  then  envision  the  same  situation  and  propagate 
<  v,x  >  as  its  Phase  II  submission.  Thus,  in  addition  to  its  local  view,  an  outer  member 
must  also  report  how  it  expects  to  change  its  local  view  next. 

We  have  described  how  a  reconfigurer  may  discover  two  different  values  were  proposed 
for  the  same  group  version.  In  S-GMP  the  reconfigurer  propagates  the  value  proposed  by  the 
process  of  hast  rank  among  those  making  proposals  (Procedure  GetStableProposal  in  the 
Appendix).  Proposition  5.7  proves  this  choices  ensures  GMP-2  and  GMP-3. 
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4.2.4  Membership  Service  Startup 

Our  approach  depends  on  the  initial  group  view  being  unique,  and  this  is  difficult  to  gua; 
antee  in  asynchronous  systems.  YVe  use  a  heuristic  borrowed  from  previous  versions  of  the 
Isis  system.  Briefly,  “cold-siart”  of  the  MRM  is  limited  to  a  small,  known  set  of  sites.  To 
cold-start,  a  process  first  queries  these  locations  to  determine  whether  any  others  have  begun 
the  cold-start  procedure.  It  continues  the  cold-start  procedure  only  if  it  determines  no  oth¬ 
ers  have  begun,  or  if  it  “outranks”  all  processes  that  have  concurrently  begun  cold-starting. 
Because  we  iterate  this  procedure  the  probability  that  two  cold-starting  processes  remain 
unaware  of  each  other  diminishes  with  each  round.  After  a  suitable  number  of  successful 
rounds  a  process  determines  it  should  start  the  MRM.  Although  probabilistic,  we  find  this 
scheme  highly  successful  in  practice.7 

5  Correctness 

The  proof  that  S  GMP  correctly  solves  Strong  GMP  is  inductive.  In  this  section  we  present  the 
more  interesting  theorems  of  the  inductive  step.  We  show  that  if  GpViewr_1  is  uniquely  de¬ 
fined,  S-GMP  results  in  exactly  one  value  being  committed  among  the  members  of  GpView*-1  to 
obtain  GpViewT  That  S-GMP  satisfies  GMP-2  and  GMP-3  follows  from  there. 

The  major  steps  in  the  proof  are,  first,  showing  that  all  initiators  attempting  to  install 
GpView*  do  so  starting  from  LocalView*-1.  As  a  result,  all  such  initiators  compete  for 
majority  approval  from  the  same  set  of  processes.  We  use  this  result  when  we  show  that  a 
reconfigurer  knows  which  of  the  two  proposals  it  may  learn  of  could  not  have  been  stable. 
While  the  other  proposal  may  not,  in  actuality,  be  stable,  by  choosing  to  propagate  i*,  the 
reconfigurer  cannot  possibly  act  inconsistently  with  the  subset  of  the  core  that  is  ‘invisible' 
to  it.  Stated  another  way,  we  show  that  all  successful  initiators  propose  the  same  value  for 
GpView*,  and  that  this  value  is  the  only  one  that  can  possibly  be  committed. 

For  brevity,  we  do  not  prove  all  propositions;  the  full  proof  of  correctness  is  in  [17]. 

5.1  The  Inductive  Step 

As  in  Section  4.2.1,  NextUpdatep  is  the  tuple  [<  v,  ver(p)  +  1  >,rank(?)Jp.  For  each  p.  Gossip. 
Disconnect  and  INITIATE()  mean  that  NextUpdatep  is  always  the  proposal  of  the  lowest-ranked 
initiator  from  which  p  received  proposals  for  version  ver(p)  +  1. 

7In  several  years  of  wide  use  no  problems  have  ever  been  traced  to  the  restart  scheme.  Note  also  that 
limiting  cold-start  to  a  single,  known,  site  suffices  to  guarantee  uniqueness  of  the  initial  view  but  unfortu¬ 
nately  this  scheme  is  now  vulnerable  to  a  liveness  problem:  we  may  be  unable  to  restart  the  system  af  er  a 
crash. 
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For  process  r  multicasting  message  m,  Acks(r,  m)  is  the  set  of  processes  from  which  r 
receives  a  message  acknowledging,  or  in  response  to  m.  Let 

Aheadr  =f  jp  |  p  Acks(r, R-int(sfafe(r)))  A  (Ver(p)  >  ver(r)j| 

Proposition  5.1  If  r  is  a  reconfiguration  initiator  with  ver(r)  =  x  —  1,  then  for  every  p 
responding  to  R-int(x  —  11,  x  —  1  <  ver(p)  <  x.  1 

For  process  p,  let  Faulty,,  =  jg  j  IN-LOCALp(<?)  A  FAULTYP(^)|.8  We  say  GpViewx_I  is  p- 
defined  along  cut  c  if  p  knows  at  c  that  every  process  in  LocaiViewp(e)  —  Faultyp  has  defined 
its  (x  —  l)3'  local  view.  Of  course  GpView*-1  may  not  be  defined  globally,  but  from  p’s  point 
of  view,  GpView1-1  is  (or  has  been)  defined.  For  a  reconfigurer  r,  GpViewr_1  is  r-defined 
at  the  end  of  Reconfiguration  Phase  I  if  every  process  in  Acks(r, R-int(x  —  1))  —  Faulty,, 
reported  a  local  version  at  least  as  large  as  x  —  1 . 

Proposition  5.2  Let  r  be  a  reconfiguration  initiator.  Then  r  proposes  version  x  if  and  only 
if  GpView1-1  is  the  most  recent  (i.e.  highest-numbered)  r-defined  system  view  at  the  end  of 
Reconfiguration  Phase  I. 

Proof  Follows  from  analyzing  procedure  DetermineProposal  in  Section  7. 

I 

From  Proposition  5.2  we  infer  that  an  initiator  attempting  to  install  version  x  has  local 
version  either  x  —  1  or  x.  We  now  show  it  can  only  have  local  version  x  —  1. 

Proposition  5.3  For  any  initiator,  r,  if  r  proposes  <  v,x  >,  then  ver(r)  =  x  —  1. 

Proof  The  proof  is  trivial  when  r  —  mgr ,  so  suppose  r  is  a  reconfiguration  initiator,  with 
ver(r)  >  x.  When  r  multicasts  R-int(s<ate(r)),  any  process  p  lagging  behind  r  adopts  r’s 
local  state  as  its  own.9  Thus,  when  it  responds  to  r’s  interrogate  message,  slatc(p)  =  state(r) 
making  ver(r)  the  most  recent  r-defined  version.  From  Proposition  5.2  r  would  then  propose 
some  value  for  version  ver(r)  +  1,  and  not  version  x.  On  the  other  hand,  ver(r)  <  x  —  1  is 
impossible  if  GpView1-1  is  the  most  recent  r-defined  view. 

■ 

Hereafter,  we  use  sub()  and  com()  to  denote  generic  submit  and  commit  messages  irre¬ 
spective  of  the  initiator’s  role  ( mgr  or  reconfigurer). 


8Faultyp  is  implicitly  indexical. 

9It  turns  out  that  any  process  r  does  not  believe  faulty  at  the  end  of  Phase  I  will  have  local  version  at 
least  ver(r)  —  1  when  it  receives  R-int(s<a<e(r)). 
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r  is  waiting  to  commit  <  ur+1,x  +  1  > 


! 


To  illustrate  the  difficulty  in  proving  Sequence  (GMP-3)  consider  the  following  situation, 
depicted  in  Figure  9.  Let  r  be  a  reconfigurer  with  ver(r)  =  x,  and  let  Acks(r,  R-mt(x))  (we 
use  R-int(ver(r))  rather  than  R-int(sfate(r))  to  get  explicit  reference  to  r' s  local  version)  be 
a  majority  subset  of  Loca!Viewx.  Proposition  5.1  means  the  largest  version  number  observed 
among  r’s  respondents  is  x  +  1,  so  suppose  Ahead,,  is  non-null  and  let  p  be  a  process  from 
which  some  member  of  Ahead,,  received  com(<  nr+],x  4-  1  >).  Suppose  further  that  p  also 
proposed  a  value,  <  vx+2,x  +  2  >,  for  version  x  -f  2  to  which  every  member  of  Ahead,,  re¬ 
sponded.  Making  matters  worse,  r  can  imagine  all  the  processes  that  did  not  respond  to 
its  own  R-int(sfafe(r))  message  may  have  responded  to  p's  sub(<  vz+2,x  +  2  >).  It  may 
then  be  the  case  that  Ahead,.  U  Acks(r, R-int(staie(r)))  (and  t?x+i,  if  it  is  an  edd())  form 
a  majority  subset  of  GpViewx+1  thereby  allowing  p  to  commit  view  x  +  2.  Trouble  arises 
if  Acks(r, R-int(sfafe(r)))  (and  vx+i  and  ux+2,  if  both  are  add()  operations)  is  a  majority 
subset  of  GpViewx+2;  neither  r  nor  any  process  in  Acks(r,  R-int(sfafe(r)))  can  know  what 
value  p  would  propose  for  view  x  +  3. 

Proposition  5.4  shows  that  when  r  is  successful  and  <  vr+1,x  -f  1  >  is  a  rcmove.{)  opera¬ 
tion.  no  previous  initiator  (like  p)  can  commit  a  version  greater  than  x  +  1.  Proposition  5.5 
shows  that  when  <  uI+1,x  -f  1  >  is  an  add( )  operation,  it  is  possible  for  p  to  continue  com¬ 
mitting  new  group  views  and  for  r  to  lag  behind  p.  However,  if  both  are  successful  for  a 
given  group  version,  both  commit  the  same  value.  These  propositions  address  exactly  the 
situation  when  it  appears  S-GMP  could  violate  GMP-2  and  GMP-3:  when  two  initiators 
are  successful  for  the  same  group  version.  Propositions  5.4  and  5.5  prove  that  even  when 
initiators  do  not  vie  for  majorities  from  among  the  same  set  of  core  members  (their  local 
views  differ),  S-GMP  is  safe. 

Proposition  5.4  Let  r  be  a  reconfiguration  initiator  with  ver(r)  =  x.  Let  Ahead,.  C 
Acks(r,  R-int(x))  report  local  version  i  +  l,  and  let  p  be  a  process  from  which  some  mem¬ 
ber  of  Ahead,,  received  com(<  ux+1,x-f  1  >).  If  Acks(r, R-int(x))  is  a  majority  subset  of 
LocalViewx  and  <  ux+i,x  +  1  >  is  the  removal  of  a  set  of  processes  from  GpViewx,  then  p 
cannot  be  successful  for  any  view  numbered  higher  than  x  -f  1. 

Proof  Let  q  €  Ahead,,  and  let  p  be  as  described.  Since  q  received  R-int(x)  from  r  after 
com(<  vx+1,x  +  1  >)  from  p,  it  must  be  that  rank(r)  <  rank(p),  so  FAULTYg(p)  holds  for 
every  such  q  in  Acks(r,  R-int(x))  (by  Gossm).  As  a  result,  the  initiator  p  can  be  successful  for 
x  +  2  if  and  only  if  Acks(r,  R-int(x))  is  a  majority  subset  of  LocalViewx+1  =  LocalViewx  — nx+1 . 

Observe  that  Loca!Viewx  =  Acks(r,R-int(x))  U  Acks(r,R-int(x)),  and  that  r  cannot 
have  received  ack(R-int(x))  responses  from  the  members  of  ux+i;10  in  other  words,  nrfJ  C 

10Reconfigurer  r  must  have  received  p’s  proposal  to  remove  vI+i  or  else  q  would  not  have  received  r's 
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Acks(r,  R-int(x)).  Thus  p  can  commit  Loca!View£+2  if  and  only  if  ^Acks(r,  R-antf  xj)  -  vT+^j 
is  a  majority  of  LocalView*+1 .  Let  o  =  |  Acks(r,R-mt(x))  |  and  6  -  |  Acks(r, R^int(xjj  j. 
Then  initiator  p  is  successful  for  <  ux+2,x  -f  2  >  if  and  only  if 


Q  -  1  Vx+l  j 

|  LocalVieWp+1 


ft  ~  |  Vx+l  j 

|  GpView*  |  -  |  ux+1 


ft  ~  1  v*+i  i  l 

ft  T  ft  |  ux+1  |  2 


ft  — 


?’x+l 


>  ft 


contradicting  the  assumption  that  Acks(r,  R-int(x))  is  a  majority  subset  of  GpViewL 

I 


Proposition  5.5  Let  r  be  a  reconfiguration  initiator  with  ver(r-)  =  x.  Let  Ahead,,  be  non¬ 
null  and  let  p  be  a  process  that  sent  com(<  uI+i,x  +  1  >)  to  some  member  of  Aheadr.  Then 
if  r  is  successful  for  <  vI+i,x  +  1  >,  then  if  p  later  submits  <  1^+2,  x  +  2  >,  either  r  or  p. 
but  not  both,  can  be  successful  for  version  x  -f  2. 

Proof  GMP-3  will  be  violated  if  p  is  able  to  commit  <  vx+2,x  +  2  >  and  r  is  able  to 
commit  <  —  Acks(r, R-int(x)),  x  +  2  >.  We  proceed  by  analyzing  the  messages  arriving  at 
vx+i. 

(a)  Consider  Figure  10  (top  diagram).  The  two-headed  split-arrow  message  from  r  to 
ux+i  represents  the  two  possibilities  for  the  arrival  of  r’s  commit  message, 

m  =  R-com(<  ux+1,x  +  1  >)  :  M-sub(<  -Acks(r,R-int(x)),x  +  2  >), 

&t  Pr+i  •  p’s  commit  message,  com(<  nI+1,  x  +  1  >),  to  vx+1  is  a  dashed  because  of  the 
possibility  that  it  may  not  be  received.  We  elide  r’s  Phase  II  submit  message. 

Suppose  the  members  of  vx+l  receive  m  from  r  before  they  receive  co o(<  vI+l,x  +  1  >) 
from  p.  Since  r’s  message  gossips  its  belief  in  p’s  faultiness,  the  members  of  vx+l  will  never 
receive  another  message  from  p.  In  particular  the  members  of  t>I+l  will  not  receive  p's 
subsequent  M-sub(<  nr+2,x  -f  2  >).  We  say  r  owns  vx+i. 

Using  a  and  a  as  defined  in  Proposition  5.4,  p  is  successful  for  version  x  +  2  if  and  only 
if  a  >  a  +  |  Vx+i  |  and  r  is  successful  for  version  x  +  2  if  and  only  if  a  +  |  vx+,  |  >  o.  Both 
conditions  cannot  hold. 

(b)  If  the  processes  in  vx+1  receive  m  from  r  after  cozn(<  vx+ux  +  1  >)  and  before 
M-sub(<  Vx+2,^  +  2  >)  from  p,  the  analysis  is  the  same  as  in  (a);  r  owns  vr+i  once  the 
members  of  that  set  receive  m. 

(c)  In  the  last  case  (Figure  10  bottom),  p  owns  vx+1  if  its  M-sub(<  vx+2,x  -j-  2  >)  message, 
gossiping  p’s  belief  in  r’s  faultiness,  arrives  at  vx+1  before  r’s  R-com()  message  does.  Then 

R-int(x);  had  r  not  received  and  responded  to  p’s  proposal,  p’s  commit  message  to  q  would  have  gossiped 
faultyp(r). 
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Cases  (a)  and  (b):  r  owns  wr+1. 


Case  (c):  p  owns  vx+\. 


Figure  10:  Case  Analysis  for  Proposition  5.5 
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p  is  successful  for  version  x  +  2  if  and  only  if  a  +  j  ur+]  j  >  a  and  r  is  successful  for  version 
x  +  2  if  and  only  if  a  >  a  -f  j  vx+i  |.  Again,  both  conditions  cannot  hold.  I 

It  remains  to  prove  that  when  r  learns  of  two  version-identical  proposals  (how  this  situa¬ 
tion  may  arise  was  described  in  Section  4,  Figure  5),  the  proposal  submitted  by  the  initiator 
of  least  rank  is  the  only  one  that  could  have  been  invisibly  committed.  That  is,  r  correctly 
identifies  the  initiator  and  proposal  that  could  have  been  successful;  as  a  result  r  cannot 
act  inconsistently  with  any  invisible  commits.  Referring  to  the  S-GMP  algorithm  in  the  Ap¬ 
pendix  this  necessity  arises  in  determining  v2  when  Ahead,.  ^  0,  and  in  determining  ul  when 
Aheadr  =  0. 

Let  GpView*-1  is  the  most  recent  r-defined  group  view  and  define  Submissionsr(x)  to  be 
the  set  of  proposed  next  updates  for  version  x  that  r  learns  about  in  response  to  its  R-int() 
message:  for  some  initiator  i, 

Submissionsr(x)  =  |  3 p  t  Acks(r, R-int())  :  NextUpdatep  =  [ur,  x,  rank(z)]  j 

We  first  describe  the  composition  of  Submissions,. (x),  showing  that  every  reconfigurer  propos¬ 
ing  version  x  either  propagates  mgr ’s  proposal  for  version  x  or  proposes  mgr  ’s  removal. 

Proposition  5.6  For  all  versions  x,  |  Submissionsr(x)  |  <  2. 

Proof  Inspecting  procedure  DetermineProposal ,  different  submissions  for  the  same  view 
can  arise  only  from  mgr  and  from  a  reconfiguration  initiator  proposing  mgr ’s  removal.  The 
latter  occurs  if  and  only  if  the  initiator  did  not  learn  of  any  outstanding  proposal  made  by 
mgr ;  that  is  if  Submissionsr(x)  =  0. 

■ 

We  say  Submissions,. (x)  is  bivalent  if  it  contains  two  distinct  values.  Corollary  5.1  follows 
by  examing  procedure  DetermineProposal  in  the  Appendix.  It  shows  that  all  reconfigurers 
either  propagate  mgr' s  unique  submission  for  view  x  or  propose  mgr' s  removal. 

Corollary  5.1  Let  r  and  r'  be  reconfigurers  proposing  version  x.  Then  if  both  their 
Submissions(x)  sets  are  bivalent,  they  are  identical: 

Submissionsr(x)  |  =  j  Submissionsr'(x)  |  =  2^  => 

Submissionsr(x)  =  Submissionsr«(x). 
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With  these  preliminaries  we  can  now  prove  only  one  of  these  two  proposals  could  possibly 
have  been  committed  (invisibly  or  otherwise),  and  that  all  reconfigurers  can  distinguish 
which  of  the  two  it  was.  This  proposition  is  vital  to  the  inductive  step:  it  shows  that  in 
going  from  GpViewx_1  to  GpViewx  one  and  only  one  value  can  be  committed  by  any  member 
of  GpView1-1  as  the  same  value  is  proposed  by  any  successful  initiator  for  the  xth  group 
version. 

Proposition  5.7  Let  r  be  a  reconfiguration  initiator.  If  Acks(r, R~int(statc(r)))  is  a  ma¬ 
jority  subset  of  LocalView,.  and  Submissionsr(x)  is  bivalent,  then  r  can  distinguish  which  of 
the  two  values  proposed  could  not  have  been  committed  invisibly. 

Proof  Let  r  be  as  described,  and  let  Submissions,,  (x)  =  {<  t>,  x  >,  <  v',  x  >}.  Let  p  be 
the  process  of  least  rank  among  those  reported  to  have  submitted  <  v,x  >,  and  let  p'  be  the 
process  of  least  rank  among  those  reported  to  have  submitted  <  v'.x  >.  r  must  decide  which 
of  the  two,  p  or  p\  could  not  have  been  successful  for  version  x.  We  show  that  r  chooses 
correctly  when  it  is  the  first  bivalent  reconfigurer  for  version  x,  then  prove  the  proposition 
inductively. 

In  order  for  either  value  to  have  been  committed,  its  initiator  must  have  garnered  majority 
approval  from  its  local  view  for  the  submitted  value.  Since  both  p  and  p'  make  version  x 
submissions,  both  must  have  local  version  x  —  1  (Proposition  5.3).  Without  loss  of  generality 
assume  rank(p)  <  rank(p'),  and  consider  the  possible  roles  p  could  have  had: 

a)  If  p  were  the  mgr ,  its  proposal  M-sub(<u,x  >)  could  not  have  reached  a  majority 

subset  of  GpViewx_1;  if  it  had,  then  p'  would  have  learned  of  it  from  some  process  in 
Acks(p', R-int(x  —  1)).  Since  r  is  the  first  bivalent  reconfigurer,  Submissions^-fx)  would 
have  to  be  the  singleton  {<  v,x  >},  which  p'  would  have  propagated  in  Determine- 
Proposal. 

Thus,  <  u,x  >  is  not  stable  because  p  cannot  have  been  successful  for  <  v,x  >.  Look¬ 
ing  at  DetermineProposal,  initiator  r  propagates  <v',x>  because  it  was  submitted 
by  p',  the  initiator  with  least  rank  among  those  mentioned  in  Submissionsr(x). 

b)  If  p  ^  mgr ,  it  is  successful  for  <  v,x  >  if  and  only  if  Acks(p,  R-sub(<  v,x  >))  is  a 

majority  subset  of  GpViewx_1.  Both  p  and  p'  were  able  to  make  proposals  so  their  first 
response  sets  were  majority  subsets.  Let  A  be  their  intersection: 

A  —  Acks(p,  R-int(x  —  1))  p|  Acks(p',R-int(x  —  !))■ 

The  gossip  property  and  rank(p)  <  rank(p')  mean  that  a  majority  of  p's  local  view 
believe  it  faulty  upon  receiving  R-int(x  —  1)  from  p'.  Disconnect  means  that 
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recva(p,  R-int(:r  —  1))  — *  recva(p\  R-int(ar  —  1))  — +  faultya(p )  Va  €  A. 


The  question  is  whether  any  a  €  A  receives  R-sub(<  v,x  >)  from  p,  which  could  only 
happen  before  it  receives  R-int(ar  —  1)  from  p': 

recva(p, R- sub (<  v,x  >))  — +  recva(p', R-int(x  —  1))  — >  fault.ya{p). 

Now,  any  such  a  would  have  forwarded  <  v,x  >  as  part  of  NextUpdatea  to  p'  when 
responding  to  R-int(x  —  1),  in  which  case  either 

1.  Submissionsp/(x)  is  bivalent,  violating  the  assumption  that  r  is  the  first  bivalent 
reconfigurer,  or 

2.  every  process  in  Acks(p',  R-int(x  —  1))  reported  <  v,x  >  as  its  next  pending  up¬ 
date.  But  in  this  case,  Submissionsp/(x)  would  again  be  the  singleton  {<  v,x  >}, 
which  p'  would  have  propagated  in  DetermineProposal. 

Thus,  no  a  €  A  received  R-sub(<  u,x  >)  from  p ,  meaning  the  only  processes  that 
could  have  are  those  in  Acks(p,  R-int(x  —  1))  —  A.  This  cannot  be  a  majority  subset  of 
GpView*-1  since  it  is  disjoint  from  Acks(p',R-int(x  —  1))  which  is  a  majority  subset. 

We  have  just  proven  the  base  case  for  the  proposition  -  when  r  is  the  first  bivalent  re- 
configurer  (it  is  not  hard  to  see  that  ‘first1  is  meaningful  and  well-defined  in  the  context 
of  successful  initiators  for  a  given  group  view).  If  r’s  proposal  reaches  a  majority  subset 
of  GpView*-1  then  the  value  it  propagated  will  be  chosen  by  the  next  reconfigurer  to 
get  a  majority  response  to  its  Reconfiguration  Phase  I  interrogate  message  as  r  would 
be  the  submitter  with  least  rank.  If  r’s  proposal  does  not  Ilreach  a  majority  subset, 
the  next  bivalent  reconfigurer  will  nonetheless  choose  as  r  did  and  so  propagate  the 
correct  value. 


■ 

Corollary  5.2  If  GpView1-1  is  defined,  there  is  at  most  one  stably-defined  proposal  for 
group  version  x. 

Proof  Proposition  5.7  proves  that  GetStableProposal  correctly  chooses  the  only  proposal 
for  a  giv^n  group  view  that  could  have  been  committed  invisibly  to  a  reconfiguration  initiator 
when  its  Phase  I  response  set  is  bivalent.  When  the  set  is  univalent  or  empty,  it  is  not  hard 
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to  see  that  DetermineProposal  is  safe.  If  this  initiator  reaches  its  commit  stage,  its  proposal 
is  stably-defined  and  identical  to  the  other  stably  defined  proposals  for  version  x. 

I 


Theorem  5.1  (Identical  Local  Views)  If  GpView2'  1  is  defined,  then  all  members  that 
survive  to  define  local  version  x  have  identical  local  xth  views. 

Proof  The  result  follows  from  Corollary  5.2;  no  process  commits  a  local  view  for  version 
x  that  differs  from  any  other  processes’  version  x  because  all  proposals  that  can  possibly 
reach  the  commit  stage  are  identical. 

I 

Note  that  Theorem  5.1  implies  no  temporal  constraints  on  local  views,  merely  that  if  p 
ever  defines  an  xth  local  view,  and  if  q,  too,  ever  defines  an  xth  local  view,  then  these  two 
are  identical.  It  does  not  require  LocaiView*  and  LocalView*  to  exist  together  in  some  global 
state.  Thus,  to  prove  S-GMP  satisfies  GMP-3  requires  slightly  more  work. 


5.2  Message  Complexity 

[16]  proves  S-GMP  (with  two  minor  modifications)  is  message  minimal  for  Strong  GMP. 
Moreover,  S-GMP  is  also  phase-minimal.  The  message-minimality  proof  gives  the  required 
direction  of  information  flow  as  well  as  the  content  of  each  message.  In  S-GMP  the  pattern 
of  required  communication  is  arranged  to  minimize  the  length  of  the  message-path  from  the 
beginning  of  the  update  algorithm  to  the  end.  For  example,  Figure  11  shows  two  ways  to 
organize  the  distributed  event  “send  a  message  to  every  process  in  S  and  collect  responses 
or  time-out”. 

Observe  that  the  Phase  I  submit  message  is  unnecessary  if  mgr  knows  a  majority  of  the 
non-faulty  outer  processes  already  believe  a  process,  say  q,  faulty.  In  this  light,  a  contingent 
update,  piggy-backed  on  a  commit  message,  can  serve  as  the  submit  message  for  the  next 
view  change.  We  can  thus  compress  successive  instances  of  Simple  S-GMP  if  mgr  makes 
known  when  it  multicasts  the  commit  message,  exactly  how  it  plans  to  change  the  group 
view  next.  In  Figure  12,  process  q'  crashes  before  responding  to  M-sub(— q),  causing  mgr  to 
suspect  q'  faulty.  By  appending  M-sub(— q’)  to  M-com(— q),  mgr  indicates  that  it  wants  to 
remove  q  from  the  just-formed  group  view.  Outer  processes  respond  to  the  piggy-backed 
commit-submit  message  as  they  would  respond  to  a  plain  submit  message.  The  correctness 
proofs  (Propositions  5.5  and  5.7  in  the  previous  section  require  only  slight  modifications  to 
handle  this  optimization). 
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is  the  Timeout  interval  p  is  willing  to  wait  for  any  response. 

Figure  11:  Two  possible  communication  patterns  accomplishing  “send  a  message  to  every 
process  in  S  and  collect  responses  or  time-out.” 


Figure  12:  Compressing  Successive  Instances  of  Simple  S-GMP. 
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When  we  can  take  advantage  of  compressing  phases  we  gain  substantially.  Define  nxd^ 
j  GpViewx  |,  and  let  ax  be  the  number  of  processes  added  to  GpViewx  and  rr  be  the  number 
of  processes  removed  from  GpViewx.  Then  Y  successive  compressed  updates  (with  no  re¬ 
configuring)  requires  and  initial  nx  submit  messages,  nx  —  rx  acknowledgement  messages, 
a  handshake  of  2ax  State-Xfer  and  ack(State-Xfer)  messages,  nx  —  rx  +  ax  commit 
messages.  To  update  the  new  GpViewx+1,  there  are  nx  —  rx  +  ax  —  rr+1  messages  to  ac¬ 
knowledge  M-sub(<  vx+i,x+  1  >),  followed  again  by  2ax+i  messages  for  the  State-Xfer- 
ack(State-Xf  er)  handshake,  and  nx  —  rx  -f  ax  —  rr+i  +  aI+i  commit  messages... 

x+Y  /  y  k  k- 1  v  ,  k  k 

nx  +  53((n*  ~  53 r‘  +  53  +  2ak  +  (n*  _  53  r«  +  53 

k=x  \  '  i—x  j=x  i—x  j—x 

x+Y  x+Y 

(2Y  +  ik  +  53 ak  + 2  53  (x  +  Y  ~  k)6k 

k~x  k=x 

where  6k  —  —  r^.  When  we  cannot  take  advantage  of  piggy-backing,  there  are 

x+Y 

Ynx  +  53(z  +  K-*)& 

kxzx 

additional  messages. 

6  Conclusion 

We  have  described  an  approach  to  the  asynchronous  system  membership  problem  which 
provides  very  strong  distributed  consistency  guarantees,  and  yet  is  inexpensive  in  comparison 
even  to  less  powerful  membership  services.  Current  distributed  systems  lack  membership 
services,  forcing  application  designers  to  solve  this  problem  repeatedly  through  ad-hoc,  and 
often  inconsistent,  mechanisms.  As  technology  such  as  GMP  becomes  more  widely  available, 
we  believe  that  a  major  obstacle  to  reliable  distributed  software  development  will  have  been 
removed. 
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7  Appendix:  The  S-GMP  Algorithm 

We  abbreviate  “either  add  or  remove  Q"  with  ±Q.  If  Q  is  a  set  or  process  identifiers, 
Mcas^Qjm)  denotes  the  compound  action  \/q  €  Q  ■  (send^q,  tn)).  Mcastp(Q,m)  is  an 
indivisible  action  only  in  the  sense  that  p  does  not  execute  any  other  events  until  all  messages 
are  sent;  it  is  not  failure-atomic.  The  message  ack(m)  acknowledges  receipt  of  message 
m.  We  do  not  explicitly  show'  gossiping,  or  channel-d:econnect,  but  assume  these  are  done 
transparently. 

Task  :  mgr 


while  (true) 
repeat 

GetUpdate(ul); 
until  (ul  ^  nil-id); 

Mcastmgr  (LocatView mgr  ,M-sub(±ul)); 

while  (ul  ^  nil-id)  /*  Compressed  algorithm  loop.  */ 
forall  p  €  LocalViewm^r 

await  either  recvmgr  (p,  ack(M-sub(±vl)))  or  faultymgr  (p); 
if  (majority  of  LocalViewm5r  didn’t  respond) 
crashmgr  ] 

DoCommit(vl,  T);  /*  Update  LocalViewm^r  according  to  ±.  */ 

GetUpdate(u2); 

if  (Joining  new  members) 

Mcastmgr  (ul,  Join  :  State-Xf  er); 
forall  p'  €  v\ 

await  either  recvmgr  (p1,  ack(  Join)  :  NextUpdatep')  or  faultymgr  {p')\ 
if  (Next  Update^  ^  ±) 
v2  *—  Next  Update^; 

Mcastmgr  (LocalViewm^r  ,M-cora(±wl)  :  M-sub(±t>2)); 
ul  <—  u2; 

/*  end  mgr  Task  */ 
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Task;  Outer  Processes,  p 


recvp(mgr ,  M-sub(±ul)); 

DoPreCommit(i>l,  ±);  /*  Mark  i>l  faulty  or  operational.  */ 

repeat 

$endp(mgr ,  ack(M-sub(±ul))); 

await  either  recvp(mgr  ,M-com(±tT)  :  M-sub(±t>2))  or  faultyp[mgr)\ 
if  (1FAULTY, {mgr)) 

DoPreCommit(u2); 

DoCommit(i?l  ±); 
ul  «—  v2] 

else  Wait-Reconfiguration(); 
until  (ul  =  nil-id); 

/*  end  Outer  Process  Task  *  j 


Reconfiguration 

Let  p  have  local  version  x  —  1.  For  Reconfiguring,  we  use  the  following  variables: 

•  NextUpdatep  is  a  tuple  of  the  form  [<  v,  x  >,i]p,  where  <  v,  x  >  is  the  value  p  is  waiting 
to  commit  to  form  LocalViewp,  and  rank(i)  is  the  rank  (in  LocalVieWp-1 )  of  the  initiator 
that  submitted  <v,x>.  When  p  receives  a  submission  it  changes  NextUpdatep  to 
reflect  the  value  proposed  and  the  initiator  proposing  it. 

•  LastCommitp  is  value  p  committed  to  form  LocalVieWp-1 . 

•  state(p)  is  the  triple  [ver(p),  NextUpdatep,  LastCommitp], 

•  Aheadr  is  the  set  values  reported  committed  for  versions  numbered  greater  than  ver(r). 
Initiator  r  receives  these  values  in  response  to  its  R-int(s/ate(r))  message.  In  actuality, 
the  only  reported  version  in  Aheadrcan  be  ver(r)  +  1. 

•  SubCurrentr  is  the  set  of  NextUpdate  values  r  receives  with  proposed  versions  equal  to 
ver(r)  +  1;  SubAheadr  is  the  set  with  proposed  versions  greater  than  ver(r)  +  1. 
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Task:  Reconfiguration  Initiator,  r,  with  ver(r)  =  x 


Mcastr('LocaW\ewr,R-int(state(r))); 
forall  p  €  LocalViewr 

await  either  recvr(p,  state{p))  or  faultyT(p ); 
if  (majority  of  LocalView-  didn’t  respond)  crashr; 

/*  Determine  the  value  and  version  to  submit  from  the  responses  received.  */ 
DetermineProposal(ul,  ver,  u2); 

DoPreCommit(iT); 

Mcastr( LocalViewr, R-sub(<  ±ul,uer  >)); 
forall  p  €  LocalViewr 

await  either  recvr(p,  ack(R-sub(<  ±yl,  ver  >)))  or  faultyr(p); 
if  (majority  of  LocalViewr  didn’t  respond)  crashr ; 

DoCommit(tT); 
if  (Joining  new  members) 

Mcast-(v  1,  Join  :  State-Xfer); 
forall  p'  €  t>  1 

await  either  rtcvT(vl,  ack(  Join)  :  NextUpdate^)  or  faultyT{p'); 
if  (NextUpdate^  ^  ±) 
v2  «—  NextUpdate^; 

Mcastr(LocalViewr ,  R-com(<  ±vl,ver  >)  :  R-sub(±u2)); 
mgr  ,  rl  <-  r,  v2\ 

Begin  mgr  Task; 
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Task:  Outer  Reconfiguration,  p 


recvp(r,R~int(state(})ry) 
if  (rank(p)  >  rank(r)) 
crashp 

/*  Catch  up  to  r  if  necessary  */ 
if  (ver(p)  <  ver(r)) 

DoCommit(LastCommitr); 
state(p)  «—  state(r); 

sendp(r,  state{p))\ 

await  either  recvp(r,  R-sub(<  ±ul,  ver  >))  or  faultyp(r); 

if  (not  FAULTYp(r)) 

DoPreCommit(ul); 

sendp(r ,  ack(R-sub(<  ±ul,  ver  >))); 

await  either  recrp(r,  R-com(<  ±ul,t»er  >)  :  M-sub(±u2))  or  faultyp(r); 

if  (not  FAULTYp(r)) 

DoCommit(wl); 
mgr ,  vl  *—  r,  v2; 
else  Wait-Reconfiguration(); 
else  Wait-Reconfiguration(); 
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/*  Sets  parameters  proposal,  version,  and  invisible.  Let  ver(r)  =  x.  */ 
Procedure:  DetermineProposal(OUT  <  proposal,  version  >,  OUT  invisible) 

Aheadr  <—  {[Ids,  ver(p)]p  |  ver(p)  =  (x  +  1)}  ; 

SubAheadr  4—  {[Ids,  ver(p)  +  1,  rank(imt)jp  j  ver(p)  =  (x  +  1)}  ; 

SubCurrentr  <—  {[lds,ver(p)  +  l,rank(im't)]p  |  ver(p)  =  x}  ; 

if  (Aheadr^  0) 

/*  Partially  committed  version  x  +  1.  */ 
proposal  4—  Ahe?dr; 

GetStableProposal(rmusi6/e,  SubAheadr); 

version  +—  x  +  1; 

returnQ; 

/*  All  respondents  report  the  same  local  version.  */ 
version  x  +  1; 
if  (SubCurrentr  is  empty) 

proposal  <—  <  —mgr  ,  x  +  1  >; 

GetUpdate(mursx6/e); 

return(); 

if  (SubCurrentr  is  a  singleton) 
proposal  4—  SubCurrentf; 

GetUpdate(muisr’6/e); 

return(); 

/*  SubCurrentr  has  two  elements.  */ 

GetStableProposal(proposa/,  SubCurrentr); 

GetUpdate(imnsi6/e); 

return(); 

/*  update-set  has  no  more  than  two  elements.  */ 

Procedure:  GetStableProposal(OUT  <  val,ver  >,  IN  update-set) 

<  val,ver  >  4—  the  element  of  update-set  with  the  lowest  ranked  initiator. 
returnQ; 
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